Enhancing Lemmatization for Mongolian and its Application to Statistical Machine Translation

نویسندگان

  • Odbayar Chimeddorj
  • Atsushi Fujii
چکیده

Lemmatization is crucial in natural language processing and information retrieval especially for highly inflected languages, such as Finnish and Mongolian. The state-of-the-art method of lemmatization for Mongolian does not need a noun dictionary and is scalable, but errors of this method are mainly caused by problems related to part of speech (POS) information. To resolve this problem, we integrate POS tagging and lemmatization for Mongolian. We evaluate the effectiveness of our method and its contribution to statistical machine translation.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Lemmatization Method for Modern Mongolian and its Application to Information Retrieval

In Modern Mongolian, a content word can be inflected when concatenated with suffixes. Identifying the original forms of content words is crucial for natural language processing and information retrieval. We propose a lemmatization method for Modern Mongolian and apply our method to indexing for information retrieval. We use technical abstracts to show the effectiveness of our method experimenta...

متن کامل

NRC Russian-English Machine Translation System for WMT 2016

We describe the statistical machine translation system developed at the National Research Council of Canada (NRC) for the Russian-English news translation task of the First Conference on Machine Translation (WMT 2016). Our submission is a phrase-based SMT system that tackles the morphological complexity of Russian through comprehensive use of lemmatization. The core of our lemmatization strateg...

متن کامل

Realignment from Finer-grained Alignment to Coarser-grained Alignment to Enhance Mongolian-Chinese SMT

The conventional Mongolian-Chinese statistical machine translation (SMT) model uses Mongolian words and Chinese words to practice the system. However, data sparsity, complex Mongolian morphology and Chinese word segmentation (CWS) errors lead to alignment errors and ambiguities. Some other works use finer-grained Mongolian stems and Chinese characters, which suffer from information loss when in...

متن کامل

Coping with Problems of Unicoded Traditional Mongolian

Traditional Mongolian Unicode Encoding has serious problems as several pairs of vowels with the same glyphs but different pronunciations are coded differently. We expose the severity of the problem by examples from our Mongolian corpus and propose two ways to alleviate the problem: first, developing a publicly available Mongolian input method that can help users to choose the correct encoding a...

متن کامل

Adapting Attention-Based Neural Network to Low-Resource Mongolian-Chinese Machine Translation

Neural machine translation (NMT) has shown very promising results for some resourceful languages like En-Fr and En-De. The success partly relies on the availability of large scale and high quality parallel corpora. We research on how to adapt NMT to very low-resource Mongolian-Chinese machine translation by introducing attention mechanism, sub-words translation, monolingual data and a NMT corre...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012